Constrained decoding with weighted automata4

Grammar-constrained generation by reading a GLR stack with a weighted automaton

I want to share a take on grammar‑constrained generation I’ve been working on for a while.

This is a preliminary writeup: enough detail to communicate the core idea—use GLR/GSS for commit, and compute masks by running a precompiled weighted automaton over the stack—and timestamp it. I’ll likely follow up with a post on the less glamorous part: all the tricks you need to compile these automata without exploding time/memory.

The problem

You have a grammar (often a JSON Schema, or a language-ish CFG) and you want the LLM to produce grammatically valid output.

At each decoding step, the LLM outputs logits over the vocabulary (V). You sample a token, append it to the output, and repeat.

Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.

So we maintain a constraint state and, for each step, compute a mask (M \subseteq V) of allowed tokens, and apply it to the logits.

No more broken JSON.

Two primitive operations

Think of constrained decoding as two functions:

get_mask(state) returns a mask (M \subseteq V) over the LLM vocabulary.
commit(state, token) updates the constraint state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask
  apply mask
  sample token
  commit token to model and constraint

Why worst‑case latency matters

LLM inference is batched: many user requests share the GPU. Mask generation sits directly on the critical per-token decoding path.

If you want ~1k tokens/sec, you have ~1ms to produce a mask. And if you miss even a single one—if a mask takes 2ms instead of 1ms—it screws up scheduling. The GPU waits on the straggler mask, and suddenly your p99 (or p99.9) latency is trashed.

Masks can’t be “fast on average.” They have to be fast every single step.

`commit` is incremental parsing (GLR + GSS)

The commit op is basically incremental parsing: you’re incrementally parsing a growing prefix.

Many battle‑tested incremental parsers (tree‑sitter, Lezer, etc.) use Generalized LR (GLR) parsing with a graph‑structured stack (GSS):

An LR parser maintains a stack of parser state IDs.
It decides what to do by looking at the top of stack plus the next terminal symbol.
The “Generalized” part handles ambiguity: instead of a single stack, you keep a DAG of possible stacks sharing structure (the GSS). When ambiguity arises, stacks fork; when they converge to the same state, they merge; error paths are pruned.

LLM token: An element of the model vocabulary—a BPE token with an ID and byte string. ~200k of these in modern tokenizers.
Grammar terminal: The symbols your parser actually consumes. Might be bytes, characters, or lexer tokens like --, NUMBER, IDENT.
Parser state ID: An integer identifying the LR automaton's current state. The parser stack is a sequence of these.

Generating masks is the hard part

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token (v \in V), check whether appending it keeps the parse valid.

But you don’t append an LLM token to the parser. You append terminal symbols.

And a single LLM token is a byte string which might:

expand into multiple terminals ("}\n" could lex as RBRACE NEWLINE),
end partway through a terminal (halfway through a string literal, halfway through a number, halfway through true),
be lexically ambiguous (e.g. longest‑match issues), or
produce no terminals yet (e.g. it adds bytes inside a string, but doesn’t close it).

So “try token (v)” is really “try every way (v)’s bytes could lex into some terminal sequence(s), then try advancing the parser for those terminals.”

That’s where runtime goes to die.

Tries/trellises and why they still hurt in p99.9 land

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.

But for general CFG-ish constraints (especially when you care about worst‑case), two things bite:

1) You end up doing real parser work “inside” masking

Even if your lexing side is deterministic (say you enforce longest‑match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.

In an LR parser, “shift terminal (t)” is often really:

while top-of-stack wants to reduce, do reductions (pop some states, push a goto state),
then shift (t).

Those reduction chains are table‑driven and fast in the happy path, but in GLR they can fan out. The operations are pointer‑heavy and cache‑unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.

If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively doing speculative GLR work for a huge number of hypothetical tokens on every step.

That’s exactly the kind of work you do not want on the critical path.

2) You still pay for long/ugly tokens

Even with tries, some tokens’ byte strings correspond to long sequences of grammar terminals. Sometimes those get quickly pruned. Sometimes they don’t.

Example: Python accepts monstrosities like:

>>> ----------------1 1

Those sixteen hyphens may be tokenized (BPE) as a single LLM token (IDs just illustrative):

----------------ID: 75351ID: 16

But that token lexes to 16 MINUS terminals:

MINUSMINUSMINUSMINUS… (×16 total)

If masking “tries” that token by actually feeding all 16 MINUS terminals through the parser, you’ve just forced 16 terminal‑level shifts (plus reductions) triggered by a single vocabulary entry. That’s pathological for worst‑case mask time.

You can transform grammars, memoize, prune, etc. But in my experience, if you keep “simulate parse work per candidate token” inside get_mask, you end up fighting worst‑case behavior forever.

Invert the problem

Token validity depends on what’s on the stack. Instead of asking:

“does token (v) work on this stack?”

for each (v \in V), ask:

“given this stack, which tokens (v) work?”

That reframing is the whole move. Now we want a data structure that implements “stack → allowed tokens” quickly and predictably.

That’s where the weighted automaton comes in.

Weighted automata over parser state IDs

Think of a finite automaton, except transitions carry “weights”, and traversing a path combines those weights.

Here:

Input symbols are LR parser state IDs (elements of the parse stack).
Weights are sets over the LLM vocabulary (V) (bitsets or, often faster in practice, rangesets).
Combining along a path is set intersection (filter tokens).
Merging paths is set union (tokens valid via any path).

So as you run the automaton on the current parse configuration:

following a transition says “these tokens remain possible if we match this stack symbol,”
taking intersections progressively filters the token set,
taking unions at merges accounts for ambiguity (either in parsing or lexing).

This is a “vectorized membership test” idea:

For each vocabulary token (v), you can imagine an automaton (A_v) that recognizes stacks on which (v) is valid.
Running 200k automata individually is useless.
A weighted automaton “packs” those 200k tests into a single run by carrying token sets as weights.

Why an automaton over stacks exists at all

If you fix a terminal sequence (\tau = t_1 t_2 \dots t_k), the set of LR stack configurations from which (\tau) can be consumed without error can be characterized by a finite-state device over parser states (this is closely related to the classical “viable prefixes are regular” result for LR parsing).

The weighted automaton is the operational way to use that fact at runtime without iterating over tokens.

Runtime sketch (single LR stack)

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> token_set
    frontier = { A.start: ALL_TOKENS }

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight      # ∩ filter tokens
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # ∪ merge paths
        frontier = new

        # optional early-exit: if nothing in `frontier` can be affected by deeper stack
        if done(frontier):
            break

    return tokens_reaching_accepting(frontier)

Operationally, the inner loop is:

transition table lookups,
set intersections,
set unions at merges.

Critically:

you are not iterating over the vocabulary, and
you are not simulating reduction chains at runtime.

The automaton is doing the “parser work” ahead of time; the runtime does only predictable set‑flow work.

Runtime on a GLR GSS (high level)

For GLR, your parse configuration is a graph-structured stack (a DAG encoding many stacks sharing suffixes). You want:

a token is valid if it’s valid on any surviving parse path.

So you run essentially the same ∩/∪ propagation, but over a product of:

the automaton’s state graph, and
the GSS graph.

When GSS paths merge, you union token sets (since any path suffices). When automaton paths merge, you union too.

One practical implication: to make this fast, you don’t want to recompute “automaton-frontier maps” per path. You want a representation where GSS nodes can carry and merge accumulated values when nodes merge. (This is the part that becomes implementation-heavy; it’s doable, but the data structure matters.)

Where does the automaton come from?

At this point “the weighted automaton” may still sound like a black box. The important thing is: it’s not runtime magic; it’s a compiled artifact.

What we want to precompute is:

given the current lexer+parser configuration (represented by active lexer states + a stack/GSS), which LLM tokens could extend the parse without immediately failing?

There are two distinct sources of complexity:

Lexical: how vocabulary token bytes can map to sequences of grammar terminals, given the current lexer configuration.
Syntactic: whether those terminals are legal given the current LR stack, including any lookahead-driven reduction chains.

The structure we compile is essentially a composition of two automata families:

a token‑bytes → terminal‑sequences automaton (Terminal DWA), and
per-terminal “stack viability / reduction” templates (Template automata),

composed into a final deterministic weighted automaton (Parser DWA) that reads parser stack state IDs and outputs token sets directly.

1) Token bytes → terminal sequences (Terminal DWA)

An LLM token is a byte string. A lexer consumes a byte stream and emits grammar terminals, but in an incremental setting:

you can be partway through recognizing a terminal,
a match can be extendable (longest‑match issues),
and ambiguity may force you to keep multiple active lexer configurations.

So the mapping “vocabulary token → terminals” is not a single lookup. What you really want is:

from a given lexer configuration (\ell), what terminal sequences could be emitted while consuming the bytes of a single vocabulary token (v)?

The Terminal DWA is a deterministic weighted automaton that compactly represents that relation:

transition labels correspond to terminal emissions (or “no emission yet” progress, depending on how you model the lexer),
transition weights are sets of vocabulary tokens consistent with that emission,
path labels correspond to terminal sequences a token could realize.

A practical way to build it:

build a trie over vocabulary token byte strings,
simulate the grammar lexer over that trie,
record which trie edges (byte extensions) correspond to emitting which terminals,
push vocabulary-token sets onto the resulting transitions,
determinize.

Important structural property: the Terminal DWA is acyclic.

This isn’t a deep theorem; it’s a consequence of the construction: you’re consuming a finite token from a finite trie. You can’t loop while consuming additional bytes of the same token. That acyclicity will matter later because it gives you a hard bound on the amount of “within-token” structure the compiled system can traverse.

Also note what I didn’t say: the Terminal DWA doesn’t need to tell you “the lexer state you end up in after the token.” For masking you only need to know whether there exists at least one valid lexing path for the token (and which terminal sequence(s) it could correspond to). The actual evolution of lexer ambiguity gets handled by commit.

2) Terminal → stack viability templates (Template automata)

Now suppose we have a grammar terminal (t). What does the parser do?

In LR parsing, “consume (t)” is generally:

do zero or more reductions driven by lookahead (t),
then shift (t),

and in GLR there can be multiple viable reduction sequences.

The key observation is: you can precompute, for each terminal (t), a finite-state object that reads a stack suffix (parser state IDs from top downward) and describes whether (t) can be legally processed (including any reduction behavior relevant to enabling that shift).

This idea is closely related to the reduction-graph style optimizations in Aycock & Horspool, “Even Faster Generalized LR Parsing” (2001): rather than “discover” reduction behavior by repeatedly simulating it on a pointer-heavy GSS, you structure it so the work is table-/automaton-driven.

Why I represent stack effects with a polycyclic monoid

When you have to compose “consume (t_1), then (t_2), …” you need to reason about how stack actions compose.

A convenient algebra for this is the polycyclic monoid:

Think of each parser state ID (s) as a stack symbol.
Introduce operations:
- (s) meaning “pop (s) from the current stack” (i.e. require (s) on top),
- (s^{-1}) meaning “push (s) onto the stack”.
Multiplication is concatenation of actions, with two crucial rules:
- cancellation: (s^{-1} s = 1) (push then pop the same symbol cancels),
- mismatch kills: (s^{-1} t = 0) for (s \neq t) (you tried to pop a different symbol than you pushed).

At a high level: this lets you symbolically carry “what stack suffix is required” while composing templates, while aggressively simplifying via cancellation and pruning impossible compositions early via the (0) element.

This isn’t “cute algebra”—it’s the thing that makes the composition tractable, because it turns “does this sequence of terminals line up with some stack behavior?” into mechanical simplification + pruning.

Acyclicity via grammar normalization

If you allow arbitrary grammars, reduction behavior can contain cycles (intuitively: you can keep reducing in ways that don’t make progress on the relevant lookahead), which would imply “in principle you might need to inspect unbounded stack depth.”

In practice, for “programming-language-ish” grammars and for JSON-schema-derived grammars, you can normalize away the patterns that create cyclic reduction behavior. Aycock & Horspool’s analysis implies (and in my experience, you can make explicit) that after certain transformations, the per-terminal reduction/viability templates can be made acyclic.

I’ll be vague here because the transformation details are the engineering-heavy part I want to write up separately, but the punchline is:

after normalization, the per-terminal templates are DAGs,
which implies a bound on how much stack suffix they can ever require.

3) Compose them into a Parser DWA (and why acyclicity matters)

Now we compose:

Terminal DWA paths (terminal sequences realizable by a vocabulary token, given a lexer configuration) with
the per-terminal templates (stack viability behavior for each terminal),

to get a Parser DWA that:

reads parser state IDs from the stack (or GSS),
and outputs the set of vocabulary tokens that can extend the parse.

Cancellation is the core mechanism during composition

When a vocabulary token emits a terminal sequence (t_1 t_2 \dots t_k), the composed object is effectively chaining the stack-effects associated with each (t_i).

This is exactly where the polycyclic monoid representation earns its keep:

adjacent “push then pop” pairs cancel,
mismatched requirements collapse the path to failure and get pruned early.

This pruning/cancellation is what lets you compile “a token that emits many terminals” into something that still only depends on a bounded stack suffix, rather than re-simulating a long reduction chain at runtime.

Nice structural fact: acyclic + acyclic ⇒ acyclic

Two properties line up:

the Terminal DWA is acyclic (finite token trie),
the per-terminal templates are acyclic after normalization.

Under composition, you get a Parser DWA that is also acyclic.

And that has a very concrete runtime implication:

There exists a fixed depth (D) such that get_mask never needs to inspect more than the top (D) parser states (per active lexer configuration), regardless of how deep the real parse stack is.

That’s the kind of “worst‑case bounded work” statement that actually matters for p99.9.

Why the runtime automaton doesn’t “update the stack”

During compilation, templates need enough symbolic machinery to compose terminal sequences correctly—i.e., to ensure “requirements produced earlier in the sequence” match “requirements consumed later,” with cancellation and mismatch pruning.

But at runtime, get_mask doesn’t need to produce the next parser stack. It only needs to answer:

does there exist a legal way to consume this token next?

The real parser/lexer (in commit) is responsible for actually performing shifts/reductions and maintaining the GSS.

So once compilation is finished, you can project away the parts of the compiled state that were only needed for symbolic stack-effect composition, and keep a machine that is “read-only over the current stack”: it consumes stack state IDs, filters token sets, and produces the mask.

You only need to read a bounded suffix of the stack

You might ask: “do you really scan the whole stack every step?”

No.

With an acyclic Parser DWA, there’s a hard bound (D) on how many stack symbols any accepting path can consume. So get_mask can look at:

only the top (D) states for a single LR stack, or
only the corresponding top-of-stack region of each path in the GSS (handled via the GSS product propagation).

You can still add an early-exit optimization on top:

in many cases, after reading only a few states, every surviving token is either
- already routed into an accepting “done” region of the automaton, or
- eliminated (no longer appears in any frontier token set).
once deeper stack symbols can’t affect the remaining frontier, you stop.

But the important point isn’t “masks stabilize quickly”; it’s that the compiled structure gives you a worst‑case bound on the amount of stack information masking can depend on.

Tokenization ambiguity (longest-match) in streaming generation

The term “token” is overloaded:

the LLM tokenizer maps text to subword units (fixed vocabulary byte strings),
the grammar tokenizer/lexer maps a byte stream to grammar terminals (IDENT, "{", NUMBER, keywords, etc.).

These don’t align.

Even with longest‑match lexing, longest‑match is forward‑looking: you can’t know a match is final until you see what comes next.

Classic example:

"+" followed by "+" should yield INCREMENT ("++"), not two PLUS.

In a streaming setting, after you see the first "+", you’re in a state where:

you have a valid match for PLUS,
but it’s extendable to INCREMENT.

Every incremental parser/lexer has to represent that uncertainty somehow. The standard move is: keep both hypotheses alive and prune later.

Operationally, my implementation does:

one branch that commits the shorter match now but marks it provisional (i.e. it can be invalidated if a longer match materializes),
another branch that waits for more bytes.

I sometimes call the provisional ones “inhibited terminals” as shorthand, but the underlying idea is not novel: it’s the usual incremental-lexing branch-and-prune pattern. It composes nicely with GLR+GSS, because “fork and later prune” is already the parser’s native operation.

Practically, the constraint state you carry around is not just “parser stack(s)”; it’s “(parser stack(s), active lexer configurations)”. That’s why get_mask is conditioned on both, and why the Parser DWA effectively has multiple start states (one per active lexer configuration).

Putting it together

`commit(state, token)` (parser work lives here)

commit uses real lexer + GLR machinery:

feed the token’s bytes into the grammar lexer incrementally,
maintain ambiguity compactly with a GSS,
prune error branches as soon as they arise.

This can still have pathological worst cases in theory (GLR is GLR), but for practical grammars it tends to behave well, and—importantly—you’re only doing it once per chosen token, not for thousands of hypothetical tokens.

`get_mask(state)` (mask work is “read-only set flow”)

get_mask uses the precompiled weighted automaton:

treat the current stack/GSS (plus active lexer configurations) as the input,
propagate token sets through the automaton via ∩ (filter) and ∪ (merge),
inspect only a bounded stack suffix (by acyclicity),
return a vocabulary mask.

So you get a clean split:

commit is responsible for advancing the real incremental parse state,
get_mask is responsible for cheaply answering “which vocabulary tokens could be next?”

Why this is the right kind of work for p99.9

Mask computation becomes:

a bounded scan over a stack suffix / a bounded dynamic-programming pass over the top of the GSS,
with inner-loop ops that are mostly:
- transition table lookups,
- set intersections,
- set unions at merges.

Critically, at runtime:

you are not iterating over the vocabulary,
you are not traversing a token trie driven by speculative parsing,
you are not simulating big chains of reductions for hypothetical tokens,
your work is bounded by compiled structure, not by “how gnarly are the tokens.”

Work scales with the size of the parse configuration you’re already maintaining (stack/GSS), not with vocabulary size or token byte weirdness.

Fast run, slow compile

All of this shifts cost to compile time.

Precompiling into the Parser DWA involves determinization, weight pushing, and aggressive simplification in a setting where weights are large token sets and merges do unions.

If you’re not careful, big grammars can blow up in memory/time. Getting compilation to behave took most of my engineering effort:

determinization order matters,
you need aggressive pruning (especially via the cancellation/mismatch structure),
you need careful sharing/interning of token-set objects to avoid copies,
you often need grammar normalization to enforce the acyclicity/bounded-depth story.

That’s probably the next post.

If you care about predictable p99.9 latency, the main takeaway is simple:

don’t make get_mask do speculative parsing—compile that speculation away, and make masking a bounded, cache-friendly walk of set intersections/unions over the existing parse structure.